home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Cream of the Crop 11
/
Cream of the Crop 11-1.iso
/
comm
/
htmst512.zip
/
HTMSTRIP.DOC
< prev
next >
Wrap
Text File
|
1995-12-05
|
23KB
|
460 lines
HTMSTRIP.DOC 1 Revised: 12/05/95
The HTMSTRIP.EXE program attempts to read HTML pages, remove the HTML coding,
and write the file out as something more useful. It can be used in conjunction
with the Wayne Software PAGINATE command (available separately) to produce
properly indented text and such.
HTML codes are surrounded within <...> indicators. For upward compatibility
reasons, Web browsers ignore any codes that they don't understand and just
process the ones they can handle.
Note that the HTMSTRIP command is currently geared for handling HTML 2.0 files
and may not be all that useful when HTML 3.0 specification files come out. It
will try to process the table-specific codes which are part of HTML 3.0 as
enhanced by Netscape.
HTMSTRIP removes all HTML codes. It also handles the standard HTML "&xxx;"
replacements (e.g. "©" is replaced by "(c)"). You can add or change these
replacements as desired by using the INI file (documented later).
HTMSTRIP is also tuned to allow it to specially-handle several imbedded HTML
codes although proper formatting depends on using the /PAG option. These codes
are the following:
<A ...> External link
<BLOCKQUOTE>...</BLOCKQUOTE> Indented block of text
<BR> Forced line break
<CENTER>...</CENTER> Centering text
<DIR>...</DIR> Directory list of items
</DL> End of definition list
<DT> First term of definition list/glossary
<H1> to <H6>...</H1> to </H6> Heading items
<HR> Horizontal rule
<IMG ...> Image
<INPUT ...> User input
<LI> Menu/Ordered/Unordered/Directory list item
<MENU>...</MENU> Menu listing
<OL>...</OL> Ordered listing
<P> Paragraph indicator
<PRE>...</PRE> Preserve spacing block (preformatted text)
<TABLE>...</TABLE> Table block
<TD>...</TD> Table data (cell)
<TH>...</TH> Table heading
<TITLE>...</TITLE> Title item
<TR>...</TR> Table row
<UL>...</UL> Unordered listing
HTMSTRIP.DOC 2 Revised: 12/05/95
If you run across other codes that become vital, let me know and I'll try to
handle them somehow.
If the /PAG option is invoked, HTMSTRIP will rewrite the file with imbedded
marker commands that are used by the PAGINATE command (also available out there
under the name PAGINymm.ZIP--see BRUCEymm.DOC for more information). These
marker commands actually try to allow the information to be displayed in its
desired format.
Once the program runs, use:
PAGINATE infile
on the file that HTMSTRIP products (the output file for HTMSTRIP is the input
file for PAGINATE).
If the /PAG option is not invoked (/-PAG is left in effect), HTMSTRIP will still
try to process the file but the final results won't be as pleasing. /WRAP (the
default) will reflow the lines for you. Lines will also be centered where
appropriate. In many cases, this effect is more than adequate so feel free to
use it. /PAG does a better job of handling things but it requires a second pass
through the files and may not be worth your time.
How to get HTML files:
Some people who are using regular Web browsers like Mosaic or Netscape don't
realize that they're automatically saving HTML files to their hard disk
throughout every Web session. That's because just about every Web browser saves
the most-recently accessed files from the Web (including HTML source code,
GIF's, and JPG's) on your hard disk and reads them from there instead of
requiring you to download them every time you go back to a previous page. This
is typically settable by you under "Preferences" and "Cache" on your Web
browser.
I usually set my Web browser to have a huge cache, maybe 10MB. Anything beats
downloading the same pages over again even at 28.8K. And I make sure that I do
not have anything specified like "clear cache at the end of every session". Then
I just go through the files in the cache subdirectory afterward and reprocess
them.
Two disadvantages to a cache... It takes up hard disk space but, hey, the Web
browser is typically in Windows so why are you surprised. The second
disadvantage is that if the page actually changes between sessions, you
typically won't notice the new page as long as it remains in your cache. If you
think a page is still in cache and should have been changed but didn't, you can
typically ask your Web browser to reload the page. On some browsers, this is
shown as an arrow in the form of a circle.
HTMSTRIP can process the entire cache subdirectory. It automatically detects
non-HTML files for you and processes accordingly. It creates new text file
versions of just the HTML pages it finds.
HTMSTRIP.DOC 3 Revised: 12/05/95
Specifying parameters:
Parameters for this program can be set in the following ways. The last setting
encountered always wins:
- Read from an *.INI file (see below),
- Through the use of an environmental variable (SET HTMSTRIP=whatever), or
- From the command line (see "Syntax" below)
The HTMSTRIP.INI file:
HTMSTRIP will read a HTMSTRIP.INI file if one is found. (You can specify a
different file name if desired.) The file is an ASCII text file that can be
created maintained by hand. The file can consist or one or more command line
parameters (only those that begin with a "/"; no multi-word ones), one statement
per line. E.g:
/EXT=.NEW
/-PAG
/LENGTH=80
The file can also contain comments which are blank lines or any line beginning
with:
; (semi-colon)
: (colon)
' (quote)
For HTMSTRIP, the file can also contain a series of lookups used to replace
incidences of the HTML "&xxx;" characters (such as "©"). A default
HTMSTRIP.INI is provided which provides over 120 lookups. To define or change
these lookups, the INI file should include a series of lines in the following
format:
&xxx; = outstr
where "&xxx;" is the HTML sequence and "outstr" is what you want to replace it
with. The outstr portion can consist of regular non-space ASCII text characters
(like "A" or "z") as well as hexadecimal values (in the form &Hxx) or decimal
values (in the form \nnn). It can also be the word "NULL" which translates the
string into nothing. You cannot use a space or equal sign in "outstr"; use the
hexadecimal or decimal representations instead. The table does not have to be
in any specified order. Lines can end with "/*" followed by a comment if you
want. Examples:
© = (c) /* Copyright symbol
° = °
é = é
ê = ê
è = è
= \032
Remember that "&xxx;" replacements are case-sensitive in HTML. "°" will not
find "&Deg;".
HTMSTRIP.DOC 4 Revised: 12/05/95
You are also allowed to redefine the strings that are used for three symbolic
references in the file. These show up only if /SYMBOLS is specified. By
default, you will see the following:
for <A> external links -> [Link]
for <IMG> image references -> [Image]
for <INPUT> user inputs -> [Input]
You can redefine any and all of these references in the same lookup file. These
substitutions are specified more or less like the previous substitutions:
<A> = [Link]
<IMG> = [Image]
<INPUT> = [Input]
Unlike with the other lookups, the left side is not case sensitive so
"<a>=[Link]" works just fine. Hexadecimal and decimal replacements are again
acceptable. You might, for example, want to redefine them like this:
<A> = \251 /* Replaces with a √ symbol
<IMG> = \015 /* Replaces with a symbol (little flash cube)
<INPUT> = ? /* Replaces with a question mark
Any symbolic references that you do not redefine will default to their original
values. If /-SYMBOLS is specified, any symbolic definitions are ignored and a
"NULL" replacement string is used for all of them.
HTMSTRIP looks for the initialization file in your default subdirectory first.
It then searches for it in the subdirectory where the executable was and then
goes through your DOS path.
The "&xxx;" and symbolic lookup table can also be read from a different table
specified by the /Linitfile parameter.
Passing in "/-I" or "/INULL" skips loading the INI file. This saves some
execution time as the program does not need to search your path for the file.
You can combine *.INI files from this and other routines I have out there. This
is useful if you're tired of having a lot of *.INI files out there. To do this,
make a single *.INI file (such as ALL.INI) and include blocks in it. The
routine will look for the block that's the name of the core routine (in this
case, "[HTMSTRIP]") and only processes the records within that block. For
example,
; ALL.INI -- contains all of the INI statements
[DATES]
/SORT
[FILL]
/ON
/SPLIT
[HTMSTRIP]
/EXT=.NEW
/LHTMSTRIP.INI
(all of the lookups)
HTMSTRIP.DOC 5 Revised: 12/05/95
You can either pass in the name of the INI file ("/IALL.INI") or the routine
will use a "SET BG=filename" (e.g. "SET BG=ALL.INI") parameter if one is
provided.
Syntax:
HTMSTRIP { filespec | @listfile } [ outfile ] [ /EXT=.xxx ]
[ /PAG | /-PAG ] [ LENGTH=n ] [ /WRAP | /-WRAP ] [ /SYMBOLS | /-SYMBOLS ]
[ /SPACES | /-SPACES ] [ /WARNINGS | /-WARNINGS ]
[ /TAB=n ] [ /RULE=string ]
[ /ALIGN | /JUSTIFY | /-ALIGN ] [ /TITLE | /-TITLE ]
[ /TABLES | /-TABLES ] [ /PAGE=n ]
[ /Iinitfile | /-I ] [ /Linitfile ] [ /? ] [ /?&H ]
where:
"filespec" tells the routine which file or files are to be processed. The
specification can include path and wildcards if desired. Typically, the file
names are *.HTM files.
"@listfile" allows you to have a variety of file specifications saved in a text
file named "listfile". Each line in the file should consist of one file
specification, each of which can include a path and wildcards if desired. Blank
lines and lines beginning with semi-colons, colons, or quotes are ignored.
"outfile" is the name of the output file to create. If no output file name is
provided, the routine will use the infile and provide an extension of *.OUT.
(The default .OUT extension can be overridden using the /EXT=.xxx parameter.) An
outfile cannot be specified if wildcards or @listfile are used for the input
file specification.
"/EXT=.xxx" allows you to specify a different default file extension for the
output file. This parameter only matters if you do not explicitly specify an
output file name. The default value is "/EXT=.OUT".
"/PAG" says to imbed PAGINATE commands in the output file. Doing so requires
that you pass the resulting file through the PAGINATE command as well.
"/-PAG" says to not imbed PAGINATE commands. This is initially the default.
HTMSTRIP will do a reasonable job of reformatting your file anyway.
"/LENGTH=n" specifies the desired length for alignment, justification, and
centering. Is also used for the title. If /WRAP is in effect, lines will be
wrapped according to this length. Initially defaults to /LENGTH=80.
"/WRAP" says to wrap lines in the output file that are longer than the /LENGTH=n
specification. If the PAGINATE command is used on the output file, /WRAP won't
have much affect. This is initially the default.
"/-WRAP" says to skip wrapping.
HTMSTRIP.DOC 6 Revised: 12/05/95
"/SYMBOLS" says to allow (unless redefined in your INI file) the "[Link]",
"[Image]", and "[Input]" indicators. This is initially the default.
"/-SYMBOLS" skips the indicators even if they're defined in your INI file.
"/SPACES" turns off extra vertical spacing between sections. There are
frequently lots of extra blank lines that appear in the output file either due
to specific HTML requests or to insure proper reformatting. Specifying /SPACES
allows these to stay there.
"/-SPACES" removes these extra blank lines. This is initially the default.
"/WARNINGS" displays warnings when HTMSTRIP finds either internal problems in
the document or things it can't handle. This is initially the default.
"/-WARNINGS" turns off the warning messages.
"/TAB=n" specifies that if /-PAG is used and table cells are encountered, they
should be spaced according to tab positions set every n-characters. Initially
defaults to /TAB=10. If you have some big tables and they look like hell,
increase the /TAB=n setting and see what happens. Conversely, if you have
tables with a lot of cells per row, reduce the setting and see what happens.
"RULE=string" specifies that a string is to be repeated the length of the line.
This is used to separate sections. The string can be a single character (like
"RULE=-"), multiple characters (like "RULE="- ""), it can contain decimal and
hexadecimal characters (like "RULE=\066\097\116"), it can be "RULE=NULL" (which
typically results in a blank line), or just simply "RULE" (which is the same
thing as "RULE=-"). Personally, if your printer supports IBM graphics
characters, I find RULE=\196 to be the most pleasing of the rule lines.
"/ALIGN" specifies that, if PAGINATE is used, text is to be aligned to the
specified length. This is initially the default.
"/JUSTIFY" specifies that, if PAGINATE is used, text is to be justified (both
margins equal) to the specified length. This is not recommended as it causes
some really bizarre formatting to show up.
"/-ALIGN" (or "/-JUSTIFY") says to not align or justify the text. This is not
recommended unless you bring the text into a word processor because the lines
are incredibly long.
"/TITLE" turns on, if PAGINATE is used, the more or less default title in
PAGINATE:
# title center length=80
^O%12% ^B%3% Revised: ^A
# end
"/-TITLE" skips the titles. This is initially the default.
HTMSTRIP.DOC 7 Revised: 12/05/95
"/TABLES" specifies that the program should try to process built-in tables.
These were introduced in the HTML 3.0 specification as expanded by Netscape. If
/TABLES is specified, the program will generate ASCII-delimited data rows which
PAGINATE can process for you, resulting in decently aligned columns and such.
There are cases where trying to process these will cause PAGINATE to fail and
you may need to turn off this option. PAGINATE will take considerably longer to
execute if tables are encountered. The main purpose of this option is to handle
numeric tables, not things like paragraphs stuck into tables. The program will
turn off the /TABLES option itself and reprocess the file if it finds that the
table is too complex for it to handle. Initially defaults to /TABLES.
"/-TABLES" skips trying to specially process tables.
"/PAGE=n" specifies the default page length. If you're using a LaserJet, you
will typically want to set this to /PAGE=59. Initially defaults to /PAGE=0
(which is the same thing as /-PAGE).
"/Iinitfile" says to read an initialization file with the file name "initfile".
The file specification *must* contain a period. If no drive or path information
is specified, the program will search for initfile beginning in your default
subdirectory and then going throughout your DOS path. The use of an
initialization file is optional. Initially defaults to "/IHTMSTRIP.INI".
"/-I" (or "/INULL") says to skip loading the initialization file.
"/Linitfile" says that the "&xxx;" and "<A>" etc lookup codes are found in a
file other than from the default "/Iinitfile" file. This is primarily useful if
you want to have a master *.INI file and a separate code lookup table.
"/?" or "/HELP" or "HELP" shows you the syntax for the command.
"/?&H" gives you a hexadecimal and decimal conversion table.
HTMSTRIP.DOC 8 Revised: 12/05/95
Author:
This program was written by Bruce Guthrie of Wayne Software. It is free for use
and redistribution provided relevant documentation is kept with the program, no
changes are made to the program or documentation, and it is not bundled with
commercial programs or charged for separately. People who need to bundle it in
for-sale packages must pay a $50 registration fee to "Wayne Software" at the
following address.
Additional information about this and other Wayne Software programs can be found
in the file BRUCEymm.DOC which should be included in the original ZIP file.
("ymm" is replaced by the last digit of the year and the two digit month of the
release. BRUCE508.DOC came out in August 1995. This same naming convention is
used in naming the ZIP file that this program was included in.) Comments and
suggestions can also be sent to:
Bruce Guthrie
Wayne Software
113 Sheffield St.
Silver Spring, MD 20910
fax: (301) 588-8986
See BRUCEymm.DOC file for additional contact information.
Foreign users: Please provide an Internet e-mail address in all correspondence.
HTMSTRIP.DOC 9 Revised: 12/05/95
Decimal and hexadecimal codes:
e.g. "\066\097\116" and "&H426174" both are "Bat"
+---------------------------------------------------------------------------
| dec hex chr | dec hex chr | dec hex chr | dec hex chr | dec hex chr |
+--------------+--------------+--------------+--------------+--------------+
| \000 &H00 nul| \052 &H34 4 | \104 &H68 h | \156 &H9C £ | \208 &HD0 ╨ |
| \001 &H01 | \053 &H35 5 | \105 &H69 i | \157 &H9D ¥ | \209 &HD1 ╤ |
| \002 &H02 | \054 &H36 6 | \106 &H6A j | \158 &H9E ₧ | \210 &HD2 ╥ |
| \003 &H03 | \055 &H37 7 | \107 &H6B k | \159 &H9F ƒ | \211 &HD3 ╙ |
| \004 &H04 | \056 &H38 8 | \108 &H6C l | \160 &HA0 á | \212 &HD4 ╘ |
| \005 &H05 | \057 &H39 9 | \109 &H6D m | \161 &HA1 í | \213 &HD5 ╒ |
| \006 &H06 | \058 &H3A : | \110 &H6E n | \162 &HA2 ó | \214 &HD6 ╓ |
| \007 &H07 bel| \059 &H3B ; | \111 &H6F o | \163 &HA3 ú | \215 &HD7 ╫ |
| \008 &H08 bs | \060 &H3C < | \112 &H70 p | \164 &HA4 ñ | \216 &HD8 ╪ |
| \009 &H09 tab| \061 &H3D = | \113 &H71 q | \165 &HA5 Ñ | \217 &HD9 ┘ |
| \010 &H0A lf | \062 &H3E > | \114 &H72 r | \166 &HA6 ª | \218 &HDA ┌ |
| \011 &H0B vt | \063 &H3F ? | \115 &H73 s | \167 &HA7 º | \219 &HDB █ |
| \012 &H0C pg | \064 &H40 @ | \116 &H74 t | \168 &HA8 ¿ | \220 &HDC ▄ |
| \013 &H0D cr | \065 &H41 A | \117 &H75 u | \169 &HA9 ⌐ | \221 &HDD ▌ |
| \014 &H0E | \066 &H42 B | \118 &H76 v | \170 &HAA ¬ | \222 &HDE ▐ |
| \015 &H0F | \067 &H43 C | \119 &H77 w | \171 &HAB ½ | \223 &HDF ▀ |
| \016 &H10 | \068 &H44 D | \120 &H78 x | \172 &HAC ¼ | \224 &HE0 α |
| \017 &H11 | \069 &H45 E | \121 &H79 y | \173 &HAD ¡ | \225 &HE1 ß |
| \018 &H12 | \070 &H46 F | \122 &H7A z | \174 &HAE « | \226 &HE2 Γ |
| \019 &H13 | \071 &H47 G | \123 &H7B { | \175 &HAF » | \227 &HE3 π |
| \020 &H14 | \072 &H48 H | \124 &H7C | | \176 &HB0 ░ | \228 &HE4 Σ |
| \021 &H15 | \073 &H49 I | \125 &H7D } | \177 &HB1 ▒ | \229 &HE5 σ |
| \022 &H16 | \074 &H4A J | \126 &H7E ~ | \178 &HB2 ▓ | \230 &HE6 µ |
| \023 &H17 | \075 &H4B K | \127 &H7F | \179 &HB3 │ | \231 &HE7 τ |
| \024 &H18 | \076 &H4C L | \128 &H80 Ç | \180 &HB4 ┤ | \232 &HE8 Φ |
| \025 &H19 | \077 &H4D M | \129 &H81 ü | \181 &HB5 ╡ | \233 &HE9 Θ |
| \026 &H1A eof| \078 &H4E N | \130 &H82 é | \182 &HB6 ╢ | \234 &HEA Ω |
| \027 &H1B esc| \079 &H4F O | \131 &H83 â | \183 &HB7 ╖ | \235 &HEB δ |
| \028 &H1C | \080 &H50 P | \132 &H84 ä | \184 &HB8 ╕ | \236 &HEC ∞ |
| \029 &H1D ???| \081 &H51 Q | \133 &H85 à | \185 &HB9 ╣ | \237 &HED φ |
| \030 &H1E ???| \082 &H52 R | \134 &H86 å | \186 &HBA ║ | \238 &HEE ε |
| \031 &H1F ???| \083 &H53 S | \135 &H87 ç | \187 &HBB ╗ | \239 &HEF ∩ |
| \032 &H20 sp | \084 &H54 T | \136 &H88 ê | \188 &HBC ╝ | \240 &HF0 ≡ |
| \033 &H21 ! | \085 &H55 U | \137 &H89 ë | \189 &HBD ╜ | \241 &HF1 ± |
| \034 &H22 " | \086 &H56 V | \138 &H8A è | \190 &HBE ╛ | \242 &HF2 ≥ |
| \035 &H23 # | \087 &H57 W | \139 &H8B ï | \191 &HBF ┐ | \243 &HF3 ≤ |
| \036 &H24 $ | \088 &H58 X | \140 &H8C î | \192 &HC0 └ | \244 &HF4 ⌠ |
| \037 &H25 % | \089 &H59 Y | \141 &H8D ì | \193 &HC1 ┴ | \245 &HF5 ⌡ |
| \038 &H26 & | \090 &H5A Z | \142 &H8E Ä | \194 &HC2 ┬ | \246 &HF6 ÷ |
| \039 &H27 ' | \091 &H5B [ | \143 &H8F Å | \195 &HC3 ├ | \247 &HF7 ≈ |
| \040 &H28 ( | \092 &H5C \ | \144 &H90 É | \196 &HC4 ─ | \248 &HF8 ° |
| \041 &H29 ) | \093 &H5D ] | \145 &H91 æ | \197 &HC5 ┼ | \249 &HF9 ∙ |
| \042 &H2A * | \094 &H5E ^ | \146 &H92 Æ | \198 &HC6 ╞ | \250 &HFA · |
| \043 &H2B + | \095 &H5F _ | \147 &H93 ô | \199 &HC7 ╟ | \251 &HFB √ |
| \044 &H2C , | \096 &H60 ` | \148 &H94 ö | \200 &HC8 ╚ | \252 &HFC ⁿ |
| \045 &H2D - | \097 &H61 a | \149 &H95 ò | \201 &HC9 ╔ | \253 &HFD ² |
| \046 &H2E . | \098 &H62 b | \150 &H96 û | \202 &HCA ╩ | \254 &HFE ■ |
| \047 &H2F / | \099 &H63 c | \151 &H97 ù | \203 &HCB ╦ | \255 &HFF |
| \048 &H30 0 | \100 &H64 d | \152 &H98 ÿ | \204 &HCC ╠ | |
| \049 &H31 1 | \101 &H65 e | \153 &H99 Ö | \205 &HCD ═ | |
| \050 &H32 2 | \102 &H66 f | \154 &H9A Ü | \206 &HCE ╬ | |
| \051 &H33 3 | \103 &H67 g | \155 &H9B ¢ | \207 &HCF ╧ | |
+--------------+--------------+--------------+--------------+--------------+